#install.packages("plotly")
#install.packages("caret")
#install.packages("randomForest")
#install.packages("fastDummies")Final Project
Packages and installations
The following packages and libraries are need to run this analysis. Packages have been commented out to avoid errors
#Loading libraries, and suppressing messages
suppressMessages({
# Loading Necessary libraries
library(caret)
library(randomForest)
library(rio)
library(tidyverse)
library(ggplot2)
library(fastDummies)
library(plotly)
})Part One: Data Analysis
For this section, I used the Global Burden of Disease database GBD Database. This database records disease burden across various risk and causes, globally. This includes measures such as Disability Adjusted Life Years,which measures years of healthy life lost, typically from premature death or living with disability due to illness, injury, and death. Other measures includes deaths, and prevalence, incidence etc. For my dataset, I gathered only data for Western Europe, as derived in the database, to show disease burden trend in the sub-region between 2015 and 2019 (for parameters used to get data see image below). Dataset includes features such as
Measure: Death, DALYs
Sex: Male, Female
Location: Country
Cause: Cause of burden, i.e. disease, or other risk factor
Age: Age group
Metric: Rate per 100,000, Number (i,e, total)
Year: ranges from 2015-2019
Upper: Confidence Interval upper bound for value
Lower: Confidence Interval lower bound for value
figure 1 showing search term to derive data from GBD
Reference: Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2019 (GBD 2019) Results. Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2020. Available from https://vizhub.healthdata.org/gbd-results/.
Task 1: Loading the data
diseaseBurden <- import("BurdenOfDisease.csv", setclass = "tibble")str(diseaseBurden, width = 84, strict.width = "cut")tibble [44,520 × 10] (S3: tbl_df/tbl/data.frame)
$ measure : chr [1:44520] "Deaths" "Deaths" "Deaths" "Deaths" ...
$ location: chr [1:44520] "Luxembourg" "Luxembourg" "Luxembourg" "Luxembourg" ...
$ sex : chr [1:44520] "Male" "Female" "Male" "Female" ...
$ age : chr [1:44520] "15-19 years" "15-19 years" "15-19 years" "15-19 years"..
$ cause : chr [1:44520] "Respiratory infections and tuberculosis" "Respiratory"..
$ metric : chr [1:44520] "Number" "Number" "Rate" "Rate" ...
$ year : int [1:44520] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ val : num [1:44520] 0.0538 0.0476 0.3221 0.2989 0.2429 ...
$ upper : num [1:44520] 0.0745 0.066 0.4459 0.4147 0.3198 ...
$ lower : num [1:44520] 0.0375 0.0333 0.2244 0.2093 0.1781 ...
dimension <- dim(diseaseBurden)
paste("Total number of rows:", dimension[1],
"| Total number of columns:", dimension[2] )[1] "Total number of rows: 44520 | Total number of columns: 10"
The dataset diseaseBurden has been loaded successfully and has 2944 rows and 10 columns.
Task 2: Most Prevalent cause of death across all countries
In this section, I would be showing the most prevalent cause of death (i.e top10) combined across all countries in Western Europe.
#Returns The most prevalent cause of death
prevalent_causes_of_death <- diseaseBurden %>%
filter(measure == "Deaths" & metric == "Number") %>%
group_by(cause) %>%
summarize(total_deaths = sum(val), .groups = "drop") %>%
arrange(desc(total_deaths)) %>%
top_n(10, total_deaths)#Table to print most prevalent cause of deaths
knitr::kable(prevalent_causes_of_death,
caption = "Top Causes of Death and Total Deaths",
align = 'c', # Center align
format = "html")| cause | total_deaths |
|---|---|
| Ischemic heart disease | 2876567.5 |
| Stroke | 1512894.7 |
| Tracheal, bronchus, and lung cancer | 1137814.3 |
| Alzheimer's disease and other dementias | 1085739.2 |
| Diabetes and kidney diseases | 870192.4 |
| Respiratory infections and tuberculosis | 668875.8 |
| Breast cancer | 434000.3 |
| Cirrhosis and other chronic liver diseases | 329854.9 |
| Parkinson's disease | 231105.7 |
| Substance use disorders | 125715.6 |
From the table above, we can see the top cause of death across all countries is Ischemic heart disease, with a total of 2876567.5 deaths, a distant second is Stroke a distant second with only 1512894.7 deaths. This shows across countries selected, cardiovascular diseases (i.e stroke and Ischemic heart disease) by far were the largest cause of mortality.
Task 3: DALYs by Countries
Disability Adjusted Life Years by countries.
dalys_by_country <- diseaseBurden %>%
filter(measure == "DALYs (Disability-Adjusted Life Years)" & metric == "Number") %>%
group_by(location) %>%
summarize(total_disability = sum(val), .groups = "drop") %>%
arrange(desc(total_disability)) knitr::kable(dalys_by_country,
caption = "Top Causes of Death and Total Deaths",
align = 'c', # Center align
format = "html")| location | total_disability |
|---|---|
| Germany | 46486297.90 |
| United Kingdom | 30951586.89 |
| Italy | 28714135.20 |
| France | 25458595.43 |
| Netherlands | 6822461.83 |
| Greece | 6474363.70 |
| Belgium | 5287353.62 |
| Sweden | 4332065.64 |
| Austria | 4104396.85 |
| Switzerland | 3156503.80 |
| Finland | 2882039.89 |
| Denmark | 2654831.23 |
| Israel | 2386549.97 |
| Norway | 1921784.37 |
| Ireland | 1681611.26 |
| Cyprus | 471622.88 |
| Luxembourg | 217081.16 |
| Malta | 211720.02 |
| Iceland | 109442.98 |
| Andorra | 27610.93 |
| Monaco | 24322.39 |
The table above shows the total number of DALYs in each country in Western Europe as retrieved from the GBD database. From observations we can see Germany is the country with the most DALYs, with France and the Netherlands coming in at second and third respectively. Monaco had the least DALYs.
Task 4: Country with Most Deaths
# Summarize total deaths by country
max_death_by_country_summary <- diseaseBurden |>
filter(measure == "Deaths" & metric == "Number") |>
group_by(location) |>
summarize(total_death = sum(val), .groups = "drop")
# Find the country with the maximum total deaths
max_death_index <- which.max(max_death_by_country_summary$total_death)
max_death_country <- max_death_by_country_summary[max_death_index, ]paste("Country with the most death is: ", max_death_country$location, "With "
, round(max_death_country$total_death, digits =2), "deaths")[1] "Country with the most death is: Germany With 2516815.9 deaths"
Germany again recorded the most deaths with about 2516815.9 deaths.
Task 5: Death Rate by Country
death_rate_by_country_year <- diseaseBurden|>
filter(measure == "Deaths" & metric == "Rate") |>
group_by(location, year) |>
summarize(death_rate = mean(val), .groups = "drop")
death_rate_by_country_year# A tibble: 105 × 3
location year death_rate
<chr> <int> <dbl>
1 Andorra 2015 24.5
2 Andorra 2016 24.2
3 Andorra 2017 23.8
4 Andorra 2018 23.2
5 Andorra 2019 22.6
6 Austria 2015 32.6
7 Austria 2016 32.0
8 Austria 2017 31.4
9 Austria 2018 31.1
10 Austria 2019 31.1
# ℹ 95 more rows
library(ggplot2)
ggplot(death_rate_by_country_year, aes(x = location, y = death_rate)) +
geom_boxplot(outlier.color = "red", outlier.shape = 1) + # Highlight outliers
theme_minimal() +
labs(title = "Comparison of Death Rates Across Countries by Year",
x = "Country",
y = "Death Rate") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Improve x-axis labels readability
plot.title = element_text(hjust = 0.5), # Center the title
legend.position = "none") # Remove legendfigure 1.0. The table above shows the average death rates per 100,000 people across all Western European Countries. We can see that although, Germany recorded the highest number of deaths between 2015-2019, when we consider the average death rates however, we can see Greece and Monaco had the highest death rate, with a median of about 42 deaths per 100,000 and 38 deaths per 100,000
Task 6: Death Rate By Sex and Age Group and Cause
daly_rate_by_sex_gender <- diseaseBurden|>
filter(measure == "DALYs (Disability-Adjusted Life Years)" & metric == "Rate") |>
group_by(age, sex) |>
summarize(daly_rate = mean(val), .groups = "drop")ggplot(daly_rate_by_sex_gender, aes(x = age, y = daly_rate, fill = sex)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(title = "DALY Rate by Age Group and Sex",
x = "Age Group",
y = "DALY Rate") +
scale_fill_brewer(palette = "Set1")figure 1.2.: Shows the average DALYs across all age groups by sex. We can observe a trend that as people get older, the number of years they loss due to disability. More so, We can also see a clear trend across age groups showing men disproportionately loss more years due to disability, with the gap between both sexes getting greater as age group increases.
Task 7: Death Rate By Disease
disease_death_rate_by_cause <- diseaseBurden|>
filter(measure == "Deaths" & metric == "Rate") |>
group_by(cause, year) |>
summarize(death_rate = mean(val), .groups = "drop")ggplot(disease_death_rate_by_cause, aes(x = year, y = death_rate, color = cause)) +
geom_line() +
theme_minimal() +
labs(title = "Death Rate by Disease Over Years",
x = "Year",
y = "Death Rate") +
theme(legend.position = "right")Figure 1.3: Shows the average death rate per 100,000 by cause across all Western European countries. By far Ischemic heart disease caused the most deaths with about 150 per 100,000.
Task 8: Percentage Change in mortality Between 2015-2019
In this section, I would be checking the percentage change in mortaly across each country in western Europe between 2015 and 2019
prevalent_causes_of_death_change <- diseaseBurden |>
filter(measure == "Deaths", metric == "Rate", year %in% c(2015, 2019)) |>
group_by(location, year) |>
summarize(total_deaths = mean(val), .groups = "drop") |>
pivot_wider(names_from = year, values_from = total_deaths) |>
mutate(percentage_change = round(((`2019` - `2015`) / `2015`) * 100, 2)) |>
arrange(desc(percentage_change))knitr::kable(prevalent_causes_of_death_change,
caption = "Percentage Change in Death rates",
align = 'c', # Center align
format = "html")| location | 2015 | 2019 | percentage_change |
|---|---|---|---|
| Greece | 39.96816 | 44.31491 | 10.88 |
| Malta | 28.94198 | 29.90994 | 3.34 |
| Finland | 31.68815 | 32.72060 | 3.26 |
| Denmark | 29.94809 | 30.45171 | 1.68 |
| Netherlands | 26.03615 | 26.41456 | 1.45 |
| France | 26.64902 | 26.99418 | 1.30 |
| Israel | 26.01043 | 26.16816 | 0.61 |
| Italy | 31.18701 | 31.16974 | -0.06 |
| United Kingdom | 31.13327 | 31.03334 | -0.32 |
| Belgium | 31.95617 | 31.71683 | -0.75 |
| Germany | 34.05582 | 33.60944 | -1.31 |
| Iceland | 25.17477 | 24.79747 | -1.50 |
| Ireland | 27.17960 | 26.72863 | -1.66 |
| Cyprus | 28.19374 | 27.70192 | -1.74 |
| Sweden | 31.52218 | 30.85869 | -2.10 |
| Switzerland | 26.81358 | 25.82940 | -3.67 |
| Austria | 32.63258 | 31.08998 | -4.73 |
| Luxembourg | 26.58189 | 25.23345 | -5.07 |
| Norway | 27.90589 | 26.45179 | -5.21 |
| Monaco | 39.57359 | 37.02747 | -6.43 |
| Andorra | 24.48760 | 22.60774 | -7.68 |
Overall, the trend across Western Europe points towards the death rate reducing, however, Greece for instance had a 10% increase in death rate per 100,000, more that 3 times the amount of the second country with death rate increases Malta
Part 2: R Package
In this section, I would be discussing and utilising a new R package which was not used during the module, and also applying it to a new dataset. My package of choice for this section is the Classification and Regression Training (caret) package. The package utilises a wide range of functionalities, with the aim of streamlining the process of model training, model tuning and performance evaluation. Some of these functionalities includes:
Data splitting
Pre-processing
feature selection
model tuning
performance evaluation
I would be demonstrating some of this functionalities, with the aim of building and testing a machine learning model with the ability to accurately classify if a person has a heart disease.
The dataset heart.csv was downloaded from Kaggle’s Heart Failure Prediction Dataset by FEDESORIANO.
Features:
Age: age of the patient [years]
Sex: sex of the patient [M: Male, F: Female]
ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
Oldpeak: oldpeak = ST [Numeric value measured in depression]
ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
HeartDisease: output class [1: heart disease, 0: Normal]
caret reference: Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05
Task 1: Loading the data
heart_data <- import("heart.csv", setclass = "tibble")str(heart_data, width = 84, strict.width = "cut")tibble [918 × 12] (S3: tbl_df/tbl/data.frame)
$ Age : int [1:918] 40 49 37 48 54 39 45 54 37 48 ...
$ Sex : chr [1:918] "M" "F" "M" "F" ...
$ ChestPainType : chr [1:918] "ATA" "NAP" "ATA" "ASY" ...
$ RestingBP : int [1:918] 140 160 130 138 150 120 130 110 140 120 ...
$ Cholesterol : int [1:918] 289 180 283 214 195 339 237 208 207 284 ...
$ FastingBS : int [1:918] 0 0 0 0 0 0 0 0 0 0 ...
$ RestingECG : chr [1:918] "Normal" "Normal" "ST" "Normal" ...
$ MaxHR : int [1:918] 172 156 98 108 122 170 170 142 130 120 ...
$ ExerciseAngina: chr [1:918] "N" "N" "N" "Y" ...
$ Oldpeak : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
$ ST_Slope : chr [1:918] "Up" "Flat" "Up" "Flat" ...
$ HeartDisease : int [1:918] 0 1 0 1 0 0 0 0 1 0 ...
Task 2: Creating Dummy Variables
Since some of my columns are of the character type (i.e. categorical data), i would be creating dummy variables for each of them th help build our models. While some models could handle categorical data, it is important they are still encoded, to ensure our outputs are interpretable, and improves the models ability to make decision or spot patterns. Binary categorical columns in the dataset like sex would be one-hot encoded meaning one of the sexes would be represented by 1 (i.e male) and the other 0 (i.e. female), while multi-class categorical features like ChestPainType, each class would be divided into new label with 0, representing if item isn’t the class and 1 if for that if indeed it belongs to the class or column. this was done to ensure multicolinearity.
#for loop that checks if the column is of type chr, checks number of categories in column.
for (col in names(heart_data)) {
if (is.character(heart_data[[col]])) {
unique_values <- unique(heart_data[[col]])
num_unique_values <- length(unique_values)
if (num_unique_values == 2) {
# Mapping binary categorical variables to 0 and 1
heart_data[[col]] <- ifelse(heart_data[[col]] == unique_values[1], 1, 0)
cat(col, ": ", unique_values[1], "=1 ", unique_values[2], "=0\n")
} else if (num_unique_values > 2) {
# Creating dummy variables for columns with more than two unique values
heart_data <- dummy_cols(heart_data, select_columns = col, remove_first_dummy = TRUE, remove_selected_columns = TRUE)
}
}
}Sex : M =1 F =0
ExerciseAngina : N =1 Y =0
# Converting the HeartDisease variable to a factor
heart_data$HeartDisease <- factor(heart_data$HeartDisease, levels = c(0, 1))Task 3: Splitting The dataset:
One of the unique functionalities provided by caret is the Train-Test split (i.e. holdout technique). In the following code below i would be splitting the data into training and testing models in preparation of our modelling and testing phase. The hold-out technique, allows us test the generalisability of our model to unseen data.
#setting seed to ensure reproducibility
set.seed(50)
# Splitting data into training and testing sets
trainIndex <- createDataPartition(heart_data$HeartDisease, p = .7, list = FALSE, times = 1)
trainData <- heart_data[trainIndex, ]
testData <- heart_data[-trainIndex, ]Task 4: Model Building and Testing with hold-out and evaluation
In this section we are going to test some functionalities of claret such as training a model and evaluating the results. Since this is a classification task, I would be using two models, a logistic regression and a random forest and comparing their results.
a. Logistic Regression
# Training a logistic regression model
logistic_model <- train(HeartDisease ~ ., data = trainData, method = "glm", family = "binomial")
# Summarsing the model
print(logistic_model)Generalized Linear Model
643 samples
15 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 643, 643, 643, 643, 643, 643, ...
Resampling results:
Accuracy Kappa
0.8654152 0.7248551
RESULTS: The logistic regression model has an accuracy of 86% on the training data.
# Predict on test data
predictions <- predict(logistic_model, testData)
# Evaluate model performance
confMatrix <- confusionMatrix(predictions, testData$HeartDisease)
print(confMatrix)Confusion Matrix and Statistics
Reference
Prediction 0 1
0 103 23
1 20 129
Accuracy : 0.8436
95% CI : (0.7952, 0.8845)
No Information Rate : 0.5527
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6845
Mcnemar's Test P-Value : 0.7604
Sensitivity : 0.8374
Specificity : 0.8487
Pos Pred Value : 0.8175
Neg Pred Value : 0.8658
Prevalence : 0.4473
Detection Rate : 0.3745
Detection Prevalence : 0.4582
Balanced Accuracy : 0.8430
'Positive' Class : 0
RESULTS: The logistic regression model seem to generalise well to unseen data, with an accuracy of 84% (CI: 79%, 88%).
True Positive (Prediction = 0, Reference = 0): 103
False Negative (Prediction = 0, Reference = 1): 23
False Positive (Prediction = 1, Reference = 0): 20
True Negative (Prediction = 1, Reference = 1): 129
The model’s shows good a high accuracy and also a high Kappa value suggesting substantial agreement. The balanced accuracy, sensitivity, and specificity are also high, indicating that the model performs well in identifying both classes, althougt might be over predicting the majority class. The confidence interval for accuracy suggests the model’s performance is reliably high. The model significantly outperforms the baseline of predicting the most frequent class. The similar rates of false positives and false negatives (as suggested by Mcnemar’s test) indicate a balanced prediction capability for both classes.
B. Random Forest
# Training a logistic regression model
rf_model <- train(HeartDisease ~ ., data = trainData, method = "rf")
# Summarize the model
print(rf_model)Random Forest
643 samples
15 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 643, 643, 643, 643, 643, 643, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8823291 0.7601187
8 0.8634863 0.7220166
15 0.8468402 0.6883856
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
# Predict on test data
rf_predictions <- predict(rf_model, testData)
# Evaluate model performance
confMatrix <- confusionMatrix(rf_predictions, testData$HeartDisease)
print(confMatrix)Confusion Matrix and Statistics
Reference
Prediction 0 1
0 98 18
1 25 134
Accuracy : 0.8436
95% CI : (0.7952, 0.8845)
No Information Rate : 0.5527
P-Value [Acc > NIR] : <2e-16
Kappa : 0.682
Mcnemar's Test P-Value : 0.3602
Sensitivity : 0.7967
Specificity : 0.8816
Pos Pred Value : 0.8448
Neg Pred Value : 0.8428
Prevalence : 0.4473
Detection Rate : 0.3564
Detection Prevalence : 0.4218
Balanced Accuracy : 0.8392
'Positive' Class : 0
RESULTS: The random forest results are s follows:
Resampling Results: mtry shows the accuracy(or performance) of the model based on random samples at each split of the random forest. the result are as follows:
For
mtry = 2: Accuracy is 0.8775, Kappa is 0.7508. (Selected as optimal result)For
mtry = 8: Accuracy is 0.8578, Kappa is 0.7112.For
mtry = 15: Accuracy is 0.8433, Kappa is 0.6817.
Contigency table:
True Positive (TP, Prediction = 0, Reference = 0): 98
False Negative (FN, Prediction = 0, Reference = 1): 17
False Positive (FP, Prediction = 1, Reference = 0): 25
True Negative (TN, Prediction = 1, Reference = 1): 135
The model seem to generalsie well to unseen data as well, with an accuracy of 85% (CI: 80% - 90%), and also shows a good Kappa value as well, showing it is better than random chance. The optimal mtry value of 2 indicates that using just two randomly selected predictors at each split of the decision tree yields the best results in this scenario. The absence of a significant difference in false positives and false negatives (Mcnemar’s test) points towards a balanced prediction capability for both classes.
Overall result: the random forest performed slightly better at correctly classifing in terms of model accuracy than the logical regression. However, when it comes to classifying a person who has disease with the risk factors provided, than the logistic regression is slightly better, having identified 103 true positives vs the random forests’s 103.
Task 5: Cross validation
Another functionality of caret is for cross validation. This is a different technique for testing and validating the performance of models and particularly good for small scale data. It works by using k-fold -1 for training and the minused 1 for testing then aggregates the results.
control <- trainControl(method="cv", number=5)
# Defining hyper parameters
grid <- expand.grid(.mtry = 2:4)
# Training the model
RF_model <- train(HeartDisease ~ ., data = heart_data, method = "rf",
trControl = control, tuneGrid = grid)
# View the best parameters
print(RF_model$bestTune) mtry
2 3
print(RF_model)Random Forest
918 samples
15 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 734, 734, 735, 734, 735
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8682347 0.7321782
3 0.8704086 0.7369131
4 0.8704086 0.7368950
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
Result: Overall, using the cross-validation method, we have improved the results of the random forest on the test data (i.e. the models generalisability) from 85% in the hold-out test, to 89% in the cross validation method
Task 6: Selecting most important feature
We can also use the random forest model to search for best features and then use thaht to build a new model
#selecting most immportant features from random forest
importance <- varImp(RF_model, scale = FALSE)
# printing feature importance
print(importance)rf variable importance
Overall
ST_Slope_Up 70.958
ST_Slope_Flat 50.232
Oldpeak 45.845
Cholesterol 45.033
MaxHR 44.279
ExerciseAngina 35.739
Age 33.669
RestingBP 28.681
ChestPainType_ATA 17.758
Sex 17.383
ChestPainType_NAP 11.055
FastingBS 11.010
RestingECG_Normal 6.646
ChestPainType_TA 4.331
RestingECG_ST 4.021
# sorting features by importance
sorted_importance <- importance$importance[order(importance$importance[,1], decreasing = TRUE),]
print(sorted_importance) [1] 70.957990 50.231959 45.844564 45.033191 44.278543 35.739128 33.669170
[8] 28.680859 17.758330 17.383170 11.054866 11.009602 6.646193 4.330830
[15] 4.020986
Above we can see the most important or informative features
# top 5 features
top_5_features <- c("ST_Slope_Up", "ST_Slope_Flat", "Oldpeak", "MaxHR", "ExerciseAngina")
# using top5 features subset of the dataset
subset_data <- heart_data[, c(top_5_features, "HeartDisease")]
# Setting up cross-validation controls
control <- trainControl(method = "cv", number = 5)
# Definin hyperparameters grid
grid <- expand.grid(.mtry = 2:4)
# training the model using only the top 5 features
RF_model_top5 <- train(HeartDisease ~ ., data = subset_data, method = "rf",
trControl = control, tuneGrid = grid)
# priinting the best parameters and the retrained model summary
print(RF_model_top5$bestTune) mtry
1 2
print(RF_model_top5)Random Forest
918 samples
5 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 734, 735, 734, 734, 735
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8322642 0.6582179
3 0.8202483 0.6362096
4 0.8039142 0.6033323
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Result: Using only the important features did not improve our results, in fact it recorded the lowest scores of all the model of 82%. However it points towards the importance of the features selected in correctly classifying a person as having a heart disease or not.
Part 3: Functions/Programming
For this section, I would be building a s3 class which uses a function and different method to print details about the dataset, summarises the dataset and plots a graph showing number of seasons played vs average number of goals and assist scored, users cans elect either stat as input, but the code only allows one of each. the dataset used season_goals.csv is from Tidy tuesday’s data collection, with this one gotten from Hockey Goals dataset collection.
Features and Definitions:
| variable | class | description |
|---|---|---|
| rank | double | Overall goals ranking (1 - 250) |
| position | character | Position = player position (C = center, RW = Right Wing, LW = left Wing) |
| hand | character | Dominant hand (left or right) |
| player | character | Player name |
| years | character | Season years (year-yr) |
| total_goals | double | Total goals scored in career |
| status | character | Status = retired or active |
| yr_start | double | year started in NHL |
| season | character | Specific season for the player |
| age | double | Age during season |
| team | character | Team during season |
| league | character | League during season |
| season_games | double | Games played in the season |
| goals | double | Goals scored in the season |
| assists | double | Assists in the season |
| points | double | Points in the season |
| plus_minus | double | Plus Minus in the season - Team points minus opponents points scored while on ice |
| penalty_min | double | Penalty Minutes in the season |
| goals_even | double | Goals scored while even strength in a season |
| goals_power_play | double | Goals scored on powerplay in a season |
| goals_short_handed | double | Goals short handed in a season |
| goals_game_winner | double | Goals that were game winner in a season |
| headshot | character | Player headshot (URL to image of their head) |
Task 1: Loading Data
player_stat <- import("season_goals.csv", setclass = "tibble")str(player_stat)tibble [4,810 × 23] (S3: tbl_df/tbl/data.frame)
$ rank : int [1:4810] 1 1 1 1 1 1 1 1 1 1 ...
$ position : chr [1:4810] "C" "C" "C" "C" ...
$ hand : chr [1:4810] "Left" "Left" "Left" "Left" ...
$ player : chr [1:4810] "Wayne Gretzky" "Wayne Gretzky" "Wayne Gretzky" "Wayne Gretzky" ...
$ years : chr [1:4810] "1979-99" "1979-99" "1979-99" "1979-99" ...
$ total_goals : int [1:4810] 894 894 894 894 894 894 894 894 894 894 ...
$ status : chr [1:4810] "Retired" "Retired" "Retired" "Retired" ...
$ yr_start : int [1:4810] 1979 1979 1979 1979 1979 1979 1979 1979 1979 1979 ...
$ season : chr [1:4810] "1978-79" "1978-79" "1978-79" "1979-80" ...
$ age : int [1:4810] 18 18 18 19 20 21 22 23 24 25 ...
$ team : chr [1:4810] "TOT" "INR" "EDO" "EDM" ...
$ league : chr [1:4810] "WHA" "WHA" "WHA" "NHL" ...
$ season_games : int [1:4810] 80 8 72 79 80 80 80 74 80 80 ...
$ goals : int [1:4810] 46 3 43 51 55 92 71 87 73 52 ...
$ assists : int [1:4810] 64 3 61 86 109 120 125 118 135 163 ...
$ points : int [1:4810] 110 6 104 137 164 212 196 205 208 215 ...
$ plus_minus : int [1:4810] 20 -3 23 14 41 80 61 78 100 71 ...
$ penalty_min : int [1:4810] 19 0 19 21 28 26 59 39 52 46 ...
$ goals_even : int [1:4810] NA 3 34 37 36 68 47 55 54 38 ...
$ goals_power_play : int [1:4810] NA 0 9 13 15 18 18 20 8 11 ...
$ goals_short_handed: int [1:4810] NA 0 0 1 4 6 6 12 11 3 ...
$ goals_game_winner : int [1:4810] NA NA NA 6 3 12 9 11 7 6 ...
$ headshot : chr [1:4810] "https://d9kjk42l7bfqz.cloudfront.net/req/201912031/images/headshots/gretzwa01.jpg" "https://d9kjk42l7bfqz.cloudfront.net/req/201912031/images/headshots/gretzwa01.jpg" "https://d9kjk42l7bfqz.cloudfront.net/req/201912031/images/headshots/gretzwa01.jpg" "https://d9kjk42l7bfqz.cloudfront.net/req/201912031/images/headshots/gretzwa01.jpg" ...
Task 2: defining S3 class
# Defining the S3 class
hockeyStats <- function() {
structure(list(data = NULL, analysis = NULL, plot = NULL), class = "hockeyStats")
}The code above defines the S3 Class I would be using to analyse the number of goals or assists scored by each player in the NHL. The data is intended to hold the dataset, analysis result, and plot is meant for a graphical representation
Task 3: Function for implementing data analysis
# method for data analysis
analyzeHockeyData <- function(data, stat = "goals") {
# Check for mandatory columns
mandatory_columns <- c(stat, "player", "position", "season")
missing_columns <- setdiff(mandatory_columns, names(data))
if (length(missing_columns) > 0) {
stop("Data must have the following features: ", paste(missing_columns, collapse = ", "))
}
# Calculates the average stat (i.e. goals or assists) and number of seasons
avg_stat <- aggregate(reformulate("player", stat), data = data, mean)
names(avg_stat)[2] <- "avg_stat"
num_seasons <- data |> group_by(player) |> summarise(seasons = n_distinct(season))
# Merging the data
analysis_data <- merge(avg_stat, num_seasons, by = "player")
analysis_data <- merge(analysis_data, data[, c("player", "position")], by = "player")
# Store results in hockeyStats object
stats <- hockeyStats()
stats$data <- data
stats$analysis <- analysis_data
return(stats)
}The analyzeHockeyData function as an implementation is used to analyse a hockey dataset, which focuses on players statistics (i.e. goals or assists) and calculates the average for each player in the dataset. it also calculates the number of seasons played, checks for the required columns, and returns a hockeyStats object with player averages.
Task 4: Printing, Summary and Plotting methods
# Printing method
print.hockeyStats <- function(x) {
print(paste("Hockey Data Analysis with", nrow(x$data), "records"))
}
registerS3method("print", "hockeyStats", print.hockeyStats)# Summary method
summary.hockeyStats <- function(object) {
list(
AverageStat = summary(object$analysis$avg_stat),
SeasonsPlayed = summary(object$analysis$seasons)
)
}
registerS3method("summary", "hockeyStats", summary.hockeyStats)the following code is defines a method which takes the user inputed stat and returns a graphical plot showing
# Plot method with plotly for interactive plot
plot.hockeyStats <- function(x) {
p <- ggplot(x$analysis, aes(x = seasons, y = avg_stat, text = player, color = position)) +
geom_point() +
labs(title = "Average Stat vs. Seasons Played", x = "Average Stat", y = "Seasons Played")
# making plot interactive
ggplotly(p, tooltip = "text")
}
registerS3method("plot", "hockeyStats", plot.hockeyStats)The
print methodentails how objects are printed, it displays a message with number of records in the datasetThe
Summary Methodprovides summary ofhockeyStatobjects, mean, median and mode of selected stats and number of seasons playedThe
Plot MethodCreates an interactive plot which shows average statistic vs seasons played, users can see players name when they over around a data point
Task 5: Results
result <- analyzeHockeyData(player_stat, stat = "goals") # can switch between goals and assists
print(result)[1] "Hockey Data Analysis with 4810 records"
The total number of records in the dataset is 4810 (i.e. rows).
summary(result)$AverageStat
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.93 18.06 21.09 22.50 26.38 57.30
$SeasonsPlayed
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 15.00 17.00 17.24 19.00 32.00
The results above shows descriptive statistics of the stat selected and seasons played. we can see the median, mean and inter quartile range
# Plotting
plot(result)Figure 3.1. showing the average number of selected stats vs number of seasons played by each player. Users can hover around each point to see player names